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Abstract 

An  interactive  framework  for  soft  segmentation  and  mat¬ 
ting  of  natural  images  and  videos  is  presented  in  this  pa¬ 
per.  The  proposed  technique  is  based  on  the  optimal,  lin¬ 
ear  time,  computation  of  weighted  geodesic  distances  to  the 
user-provided  scribbles,  from  which  the  whole  data  is  au¬ 
tomatically  segmented.  The  weights  are  based  on  spatial 
and/or  temporal  gradients,  without  explicit  optical  flow  or 
any  advanced  and  often  computationally  expensive  feature 
detectors.  These  could  be  naturally  added  to  the  proposed 
framework  as  well  if  desired,  in  the  form  of  weights  in  the 
geodesic  distances.  A  localized  refinement  step  follows  this 
fast  segmentation  in  order  to  accurately  compute  the  cor¬ 
responding  matte  function.  Additional  constraints  into  the 
distance  definition  permit  to  efficiently  handle  occlusions 
such  as  people  or  objects  crossing  each  other  in  a  video  se¬ 
quence.  The  presentation  of  the  framework  is  complemented 
with  numerous  and  diverse  examples,  including  extraction 
of  moving  foreground  from  dynamic  background,  and  com¬ 
parisons  with  the  recent  literature. 


1.  Introduction 

The  segmentation  of  natural  images  and  videos  is  one 
of  the  most  fundamental  and  challenging  problems  in  im¬ 
age  processing.  One  of  its  applications  is  to  extract  the 
foreground  object  (or  object  of  interest)  out  of  the  cluttered 
background,  and,  for  example  composite  it  onto  a  new  back¬ 
ground  without  visual  artifacts  (see  also  [4]  for  additional 
applications  in  video).  For  complex  images,  as  well  as  sub¬ 
jective  applications,  there  can  be  more  than  one  interpreta¬ 
tion  of  the  foreground  or  objects  of  interest  (in  absence  of 
higher  level  knowledge),  thus  making  the  task  ill-posed  and 
ambiguous.  It  is  often  imperative  then  to  incorporate  some 
user  intervention,  which  encodes  prior  information,  into  the 
process.  Specifically,  the  user  can  draw  rough  scribbles  la¬ 
beling  the  regions  of  interest  and  then  the  image/video  is 
automatically  segmented.  The  user  is  allowed  to  add  more 

*  Work  supported  by  ONR,  NGA,  NSF,  DARPA,  and  ARO. 


scribbles  to  achieve  the  ideal  result,  although  of  course,  the 
goal  is  to  minimize  as  much  as  possible  the  user  effort. 

Closely  connected  to  the  segmentation  of  objects  of  in¬ 
terest,  image  and  video  matting  refers  to  the  process  of 
reconstructing  the  foreground/background  components  and 
the  alpha  value  (transparency)  of  each  pixel.  This  is  im¬ 
portant  for  applications  such  as  extracting  hair  strands  or 
blurry  edges,  as  well  as  for  compositing.  Being  inherently 
under-constrained  (solving  for  three  components,  F  (fore¬ 
ground),  B  (background),  and  a  transparency,  with  only 
the  observed  color),  the  matting  problem  also  requires  pri¬ 
ors,  such  as  user  interactions,  which  could  be  in  the  form  of 
scribbles  as  in  the  segmentation  task,  or  a  complete  trimap. 

In  this  paper,  we  propose  a  fast  weighted-distance-based 
technique  for  image  and  video  segmentation  and  matting 
from  very  few  and  roughly  placed  user  scribbles  (often 
just  one  scribble  for  the  foreground  and  one  for  the  back¬ 
ground).  The  distance  (geodesic)  computation  is  linear  in 
time,  and  thereby  optimal  (with  minimal  memory  require¬ 
ments  as  well).  The  weights  are  based  on  simple  proper¬ 
ties  such  as  spatial  and  temporal  gradients,  while  more  so¬ 
phisticated  features  can  be  naturally  included  as  well.  The 
proposed  framework  can  handle  diverse  data,  including  dy¬ 
namic  background,  moving  cameras,  and  objects  crossing 
each  other  in  the  video. 

Following  a  brief  literature  review,  Section  2,  we  de¬ 
scribe  the  framework  for  segmenting  and  matting  still  im¬ 
ages,  Section  3.  Examples  and  comparison  with  the  litera¬ 
ture  are  presented  in  this  section  as  well.  Then,  we  extend  it 
to  video  applications,  where  a  long  video  can  be  processed 
with  little  user  interaction,  Section  4.  We  explain  how  to 
add  constraints  to  the  distance  computation  to  handle  mov¬ 
ing  objects  occluding  each  other,  e.g.,  people/objects  cross¬ 
ing  each  other.  We  illustrate  our  method  with  additional 
video  examples  in  Section  5,  and  conclude  and  discuss  fu¬ 
ture  research  in  Section  6.  Before  proceeding,  let  us  explic¬ 
itly  present  the  key  attributes  of  the  proposed  framework: 

1.  It  is  based  on  weighted  distance  functions  (geodesics), 
thereby  solving  a  first  order  geometric  Hamilton- Jacobi 
equation  in  computationally  optimal  linear  time.  This 
makes  the  proposed  framework  natural  for  user-interactive 
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processing  of  images  and  videos. 

2.  It  produces  very  good,  state-of-the-art  results,  with  very 
few  user  provide  scribbles  and  very  simple  attributes  defin¬ 
ing  the  weights  in  the  distance  computation.  We  often  use 
just  a  couple  of  rough  scribbles  for  still  images  (one  for  the 
foreground  and  one  for  the  background)  and  scribble  one 
frame  every  70  or  so  for  videos. 

3.  It  applies  to  a  large  class  of  natural  data,  and  since  it 
avoids  off-line  learning,  it  is  not  limited  to  pre-observed  and 
classified  classes  and  to  the  availability  of  ground-truth  and 
hand  segmented  data. 

4.  It  can  handle  dynamic  background  in  video  as  well  as 
crossing  objects  of  interest. 

5.  The  framework  is  general  so  that  additional  attributes 
can  be  naturally  included  in  the  weights  for  the  geodesic 
distances  if  so  required  for  a  particular  type  of  data. 

2.  Related  work 

One  important  class  of  related  works  is  based  on  en¬ 
ergy  formulations  which  are  minimized  via  discrete  opti¬ 
mization  techniques.  The  pioneering  graph  cuts  technique, 
[5],  addresses  the  foreground/background  interactive  seg¬ 
mentation  in  still  images  via  max-flow/min-cut  energy  min¬ 
imization.  The  energy  balances  between  the  probability 
of  pixels  belonging  to  the  foreground  (likelihood)  and  the 
edge  contrast,  imposing  regularization.  The  user-provided 
scribbles  collect  statistical  information  on  pixels  and  also 
serve  as  hard  constraints.  The  Grabcut  algorithm,  [14],  fur¬ 
ther  simplifies  the  user  interaction.  Scribbles  can  be  in¬ 
teractively  added  to  improve  the  initial  segmentation.  Full 
color  statistics  are  used,  modeled  as  mixtures  of  Gaussians 
(here,  in  contrast,  we  use  fast  kernel  density  estimation), 
and  these  are  updated  as  the  segmentation  progresses.  This 
can  help  but  also  hurt  by  propagating  segmentation  errors. 
Very  good  and  fast  results  were  demonstrated  with  this  tech¬ 
nique.  A  number  of  methods  have  been  proposed  extending 
this  framework,  aiming  at  devising  more  sophisticated  en¬ 
ergy  formulations  and  at  extending  it  to  higher  dimensions 
(video).  The  Bilayer  approach,  [8],  segments  videos  with 
basically  static  background.  It  incorporates  an  additional 
second  order  temporal  transition  prior  term  and  a  motion 
likelihood  term.  Each  frame  is  segmented  via  graph  cuts, 
conditionally  dependent  on  the  previous  two  frames.  Al¬ 
though  excellent  results  are  reported  for  a  particular  type 
of  videos,  this  method  makes  assumptions  about  the  dif¬ 
ferent  behaviors  of  foreground  and  background  pixels  and 
deals  with  videos  with  mostly  static  backgrounds  (they  do 
permit  a  moving  object  in  the  background  as  long  as  it  is 
different  enough  from  the  foreground).  Moreover,  it  needs 
to  learn  the  motion  statistics,  which  is  very  useful  as  they 
have  cleverly  incorporated  in  their  system,  but  requires  the 
availability  of  pre-segmented  ground-truth  training  data  and 
of  video  classes  (to  train  and  apply  with  videos  having  the 


same  type  of  motion). 

Interactive  video  cutout,  [18],  presents  a  system  where 
the  user  draws  scribbles  in  3D  space.  A  hierarchical  mean- 
shift  preprocess  is  employed  to  cluster  pixels  into  super¬ 
nodes,  which  greatly  reduces  the  computation  of  the  min- 
cut  problem.  In  [9],  the  author  uses  random  walks  for  soft 
image  segmentation.  Each  pixel  is  assigned  the  label  with 
maximal  probability  that  a  random  walker  reaches  it  when 
starting  from  the  corresponding  scribbles.  The  authors  of 
[20]  propose  an  MRF  framework  to  solve  segmentation  and 
matting  simultaneously.  The  basic  idea  is  to  minimize  the 
fitting  error  of  the  matte  while  maintaining  its  smoothness. 
The  uncertainties  (0  for  the  scribbles  and  1  for  all  unknown 
pixels)  are  propagated  to  the  rest  of  the  image  using  belief 
propagation.  Once  the  alpha  values  are  found,  the  F  and 
B  components  are  estimated.  In  [1  ],  a  local  linear  relation 
between  the  alpha  values  and  image  intensities  is  assumed, 
that  is,  the  pixel’s  alpha  value  can  be  immediately  deter¬ 
mined  in  a  local  region  if  its  intensity  is  known.  The  matting 
problem  is  solved  by  minimizing  a  cost  function  combining 
the  prediction  error,  the  regularization  of  alpha  values,  and 
the  user-supplied  scribbles  which  indicate  constraints  to  the 
optimization  problem. 

Poisson  matting,  []  ],  and  Bayesian  matting,  [7],  are 
two  important  matting  techniques  that  use  trimaps  as  inputs. 
Poisson  matting  computes  the  alpha  matte  by  solving  the 
second  order  Poisson  equation  with  Dirichlet  boundary  con¬ 
ditions.1  An  assumption  is  made  by  neglecting  the  gradients 
of  F  and  B ,  considering  the  matte  gradient  proportional  to 
the  image  gradient.  Additional  operations  are  performed  to 
adjust  to  local  regions.  Bayesian  matting  simultaneously  es¬ 
timates  F,  B,  and  a  by  maximizing  a  posterior  probability. 
For  each  pixel  in  the  trimaps  region,  it  models  the  known 
F  and  B  colors  around  as  mixture  of  oriented  Gaussians  in 
color  space  (again,  we  use  fast  kernel  densities  instead).  An 
(F,B,  a)  triplet  is  computed  as  the  one  that  most  probably 
generates  the  observed  color  of  that  pixel.  This  technique 
is  applied  to  videos  in  [6],  where  the  trimap  is  temporally 
propagated  using  optical  flow  and  the  matte  is  pulled  out  in¬ 
dividually  in  each  frame  by  the  Bayesian  matting  algorithm. 
Explicit  optical  flow  is  not  used  in  our  method,  although  it 
could  be  incorporated  as  part  of  the  weights  in  the  geodesic 
computation. 

After  this  paper  was  submitted  for  publication,  a  few  ad¬ 
ditional  matting  techniques  have  been  published.  The  spec¬ 
tral  matting  technique,  [10],  automatically  computes  a  set 
of  soft  matting  components  via  a  linear  transformation  of 
the  smallest  eigenvectors  of  the  matting  Laplacian  matrix 
[12].  These  components  are  then  selected  and  grouped  into 
semantically  reasonable  mattes  either  in  an  unsupervised  or 
supervised  fashion.  The  main  drawback  of  this  algorithm 

^ote  that  in  contrast  with  this,  we  solve  a  first  order  Hamilton- Jacobi 
equation,  which  is  computationally  more  efficient. 


is  its  high  computational  cost  -  it  takes  several  minutes  to 
compute  the  matting  components  for  small  sized  images.  In 
addition,  it  is  not  intuitive  where  to  place  the  constraints. 
The  authors  of  [19]  proposed  an  improved  color  sampling 
method  for  natural  image  matting,  and  demonstrated  very 
good  performance.  The  authors  in  [17]  implemented  an  in¬ 
terface  for  interactive  realtime  matting.  The  user  roughly 
tracks  the  boundary  with  a  self-adjustable  brush.  Like  in 
[19],  the  matte  is  pulled  out  in  local  regions,  solving  a  soft 
graph-labeling  problem.  Flash  cut,  [1  ],  extracts  the  fore¬ 
ground  layers  of  flash/no-flash  image  pairs,  using  the  prior 
information  that  only  the  foreground  is  significantly  bright¬ 
ened.  This  information  is  incorporated  in  an  graph  cut 
energy  framework.  The  segmentation  algorithm  is  shown 
to  tolerate  some  amount  of  foreground  motion  and  camera 
shake. 

Our  work  is  inspired  by  [23],  where  the  authors,  fol¬ 
lowing  [11],  show  how  to  use  distance  functions  for  image 
colorization.  As  here,  these  distances  are  optimally  com¬ 
puted  in  linear  time  [22].  This  was  then  extended  in  [13] 
for  segmentation.  In  contrast  with  this  work,  we  use  sig¬ 
nificantly  less  scribbles  per  image  (thanks  in  part  to  a  more 
efficient  modeling  of  the  corresponding  probability  distri¬ 
bution  functions),  see  Figure  1,  extend  the  work  to  video, 
and  also  produce  explicit  mattes  (F,  B,  and  a). 


Figure  1.  Figures  (a)-(d)  show  the  user  inputs  and  results  from 
[13].  Figures  (e)-(h)  correspond  to  the  new  inputs  and  results  for 
the  same  images,  leading  to  better  results  with  less  scribbles. 


3.  General  framework:  Still  images 

As  discussed  in  the  introduction,  our  algorithm  starts 
from  two  types  of  user-provided  scribbles,  T  for  foreground 
and  B  for  background,  roughly  placed  across  the  main  re¬ 
gions  of  interest.  Now  the  problem  is  how  to  learn  from 
them  and  propagate  this  prior  information/labeling  to  the 
entire  image. 

We  use  the  geodesic  distance  from  these  scribbles  to 


(a)  (b)  (c) 


Figure  2.  (a)  A  hard  segmentation  (white  curve)  is  quickly  found 
by  a  few  scribbles,  (b)  Automatically  generated  trimap,  a  narrow 
band  around  the  white  curve,  and  new  automatically  generated 
local  scribbles  ( borders  of  the  band),  (c)  Obtained  segmentation 
and  alpha  matting,  (d)  Pj^(x).  Dark  indicates  low  probabilities 
and  white  high  probabilities.  ( Note  that  this  is  not  the  final  alpha 
matte.)  (e)  Djr(x).  (f)  D&(x).  Blue  indicates  low  distances  and 
red  high  distances. 


classify  the  pixels,  labeling  them  T  or  B.  The  geodesic  dis¬ 
tance  d(x)  is  simply  the  smallest  integral  of  a  weight  func¬ 
tion  over  all  paths  from  the  scribbles  to  x.  Specifically,  let 
Ptjr  be  the  set  of  pixels  with  label/scribble  T  and  D#  those 
corresponding  to  the  background  scribble  B.  The  weighted 
distance  (geodesic)  from  each  of  the  two  scribbles  for  every 
pixel  x  is  then  computed  as 

Di(x)  =  min  d(s,  x),  /  G  {F,  B},  (1) 

sefii 

where 

d(s1,s2)--  min  f  \W  ■  CSuS2(p)\dp,  (2) 

(^s1,s2  J 0 

where  CSliS2 (p)  is  a  path  connecting  the  pixels  si,  $2  (f°r 
p  =  0  and  p  =  1  respectively).  The  weights  W  are  set  to 
the  gradient  of  the  likelihood  that  a  pixel  belongs  to  the  fore¬ 
ground  (resp.  background),  i.e.,  W  =  VPjr(x).  This  likeli¬ 
hood  is  obtained  from  the  samples  on  the  provided  scribbles 
in  Luv  color  space,  i.e.,  PP{x)  =  |B) ,  where 

Pr(x\3F)  is  the  color  PDF  of  Djr,  obtained  via  the  fast  ker¬ 
nel  density  estimation  ([21])  (same  process  for  the  back¬ 
ground  PDF).  A  pixel  is  close  in  this  metric  to  a  scribble 
in  the  sense  that  there  exists  a  path  along  which  the  like¬ 
lihood  function  does  not  change  much,  Figure  2(d).  Fol¬ 
lowing  [22],  we  can  efficiently  compute  the  distances,  in 
optimal  linear  time,  and  assign  each  pixel  to  the  label  with 
the  shorter  distance.  The  user  can  progressively  add  new 
scribbles  to  achieve  the  desired  result,  although  often  a  sin¬ 
gle  scribble  for  the  foreground  and  one  for  the  background 
(regardless  of  how  cluttered  it  is),  is  sufficient.  If  a  refine¬ 
ment  step  is  needed,  a  narrow  band  is  spanned  across  the 


current  boundaries  (see  Figure  2(b)),  and  its  borders  serve 
as  new  T  and  B  scribbles,  thereby  reducing  the  computa¬ 
tional  cost  just  to  a  few  pixels  in  the  band,  while  at  the  same 
time  refining  the  likelihood  functions  and  locally  adapting 
them  to  the  region  of  interest. 

Once  this  distance  has  been  obtained,  the  alpha  channel 
inside  the  band  is  explicitly  computed  as 


ul{x)  =  Dt(x)-r  ■  P^x),  le{F,B},  (3) 


a(x) 


up{x) 

UJp(x)  +ujb(x)' 


(4) 


where  Pi{x)  is  locally  recomputed  using  the  feature  vec¬ 
tor  (L,  u,  v,  r),  r  E  [0, 1]  parameterizes  the  band  along  the 
boundary  (leading  to  local  PDF  estimations),  and  is  peri¬ 
odic  with  period  1  if  the  curve  is  closed  (see  Figure  2(b)). 
r  controls  the  smoothness  of  the  edges.  When  r  =  0, 
a(x)  =  Pjr{pc)\  when  r  — >  oo,  a{x)  becomes  hard  seg¬ 
mentation  (typically  0  <  r  <  2  in  our  examples).  This  al¬ 
pha  matte  combines  the  weighted  distance  (measuring  how 
“close”  the  pixel  is  to  the  scribble)  and  the  probability  based 
on  the  fast  kernel  density  estimation  (measuring  how  prob¬ 
able  is  its  color).  Note  that  regularization,  e,g,  anisotropic 
diffusion  of  a ,  can  be  applied  inside  the  band  as  well  if 
needed.  Since  this  is  done  locally,  virtually  no  computa¬ 
tional  cost  is  added. 

After  the  matte  a  is  computed,  we  follow  the  method  in 
[20]  to  estimate  the  Fx  and  Bx  components  (in  Luv  space) 
for  each  pixel  x  inside  the  band.  We  randomly  sample  the 
foreground  and  background  colors  in  the  neighborhood  of 
x  and  use  the  pair  that  gives  the  minimal  fitting  error: 


{Fx,  Bx) 


arg  min  \\FiO,x  +  Bj(l  -  ax)  —  4|],  (5) 

Fi  i  Bj 


where  i  E  N(x)  fl  Djr,  j  E  N(x)  D  Fi,Bj  are  fore¬ 
ground  and  background  colors  sampled  on  the  (band  bound¬ 
ary)  scribbles  within  the  window  N{x)  centered  at  x,  and 
Ix  is  the  observed  color. 

With  these  components,  we  can  now  paste  the  object 
onto  a  new  background  if  desired,  with  no  noticeable  vi¬ 
sual  artifacts  by  the  simple  matting  equation  C*  =  Fxax  + 
B*(l  —  ax),  where  the  composite  color  C*  is  a  linear  com¬ 
bination  of  foreground  color  Fx  and  the  new  background 
color  B*  for  every  pixel  x  in  the  image. 

Figure  3  shows  our  results  for  still  images.  Note  how 
simple  scribbles  can  handle  cluttered  and  diverse  images. 
Figure  4  presents  comparisons  with  the  work  in  [20]  ,  Pho¬ 
toshop  Extract  Filter  [1],  Photoshop  CS3  Quick  Selection 
&  Refine  Edge  tools  [3]  ,  Corel  Knockout2  [:  ],  and  Spec¬ 
tral  Matting  [10]  (note  how  our  proposed  approach  needs 
significantly  less  scribbles). 


Figure  3.  Left  column:  original  images  with  user-provided  scrib¬ 
bles.  Blue  for  foreground  and  green  for  background.  Middle  col¬ 
umn:  Computed  alpha  matte.  Right  column:  Foregrounds  pasted 
on  blue  backgrounds  ( blue  ( constant )  backgrounds  are  selected 
since  they  often  permit  much  more  careful  inspection  of  the  results 
than  pasting  on  cluttered  backgrounds). 


4.  Interactive  video  segmentation  and  matting 

The  above  described  framework  is  now  extended  to 
videos,  modeled  as  3D  images,  in  which  every  pixel  has  six 
neighbors,  four  spatial  and  two  temporal  (except  the  ones  on 
the  frame  borders).  The  scribbles,  drawn  on  one  or  several 
frames,  propagate  throughout  the  whole  video  by  weighted 
distances  in  spatio-temporal  space.  In  particular,  spatial  and 
temporal  gradients  of  the  likelihood  function  are  used  to  de¬ 
fine  the  weight  W  in  the  geodesic  computation  in  Equation 
(2).  Note  that  there  is  no  explicit  use  of  optical  flow  in  the 
framework  (or  motion  models  as  in  the  works  described  in 
Section  2),  thereby  not  only  simplifying  the  computations 
but  also  permitting  to  deal  with  dynamic  background  and 
not  limiting  the  work  to  pre-specified  motion  classes.  As 
we  will  see  in  the  experimental  section,  this  simple  model 
is  already  very  useful  for  numerous  scenarios.  We  now  in¬ 
troduce  some  additional  extensions  to  make  it  more  general. 

4.1.  Constrained  spatio-temporal  distance 

In  still  images,  a  single  T  scribble  and  a  single  B  scrib¬ 
ble  always  return  two  connected  components.  This  can  be 
easily  proved  by  the  triangle  inequality  property  of  the  dis¬ 
tance  function  (this  also  helps  to  prove  the  robustness  of  the 


Figure  4.  Comparison  of  our  results  (first  two  rows,  first  column ) 
with  [20]  (first  two  rows,  second  column),  Photoshop  Extract  Fil¬ 
ter  [1]  (first  two  rows,  third  column),  Photoshop  CS3  Quick  Selec¬ 
tion  &  Refine  Edge  tools  [3]  (last  two  rows,  first  column),  Corel 
Knockout2  [2]  (last  two  rows,  second  column),  and  Spectral  Mat¬ 
ting  [10]  (last  two  rows,  last  column).  The  first  and  third  rows 
are  the  user  inputs.  The  second  and  last  rows  are  the  correspond¬ 
ing  results  on  blue  background.  (]1  ]  and  [2]  require  complete 
trimaps.) 

method  with  respect  to  the  exact  placement  of  the  scrib¬ 
bles,  see  [23]).  If  the  user  marks  a  circle  of  B  scribble 
around  the  object,  all  the  exterior  region  will  be  classified 
as  background.  However,  this  is  no  longer  guaranteed  in 
the  3D  spatio-temporal  case.  Consider  the  simple  scenario 
in  Figure  5.  Two  objects  with  similar  color/feature  dis¬ 
tributions  move  towards  each  other,  cross,  and  split  apart. 
The  inside  of  the  tube  has  low  distances  to  the  T  scribble 
(shown  in  red).  The  T  scribble  in  object  A  propagates  to 
the  frames  with  occlusions,  and  then  backwards  to  object 
B  ( B  refers  now  to  the  second  object  in  Figure  5  and  not 
to  the  background  value).  Although  the  user  might  intend 
to  separate  object  A  as  foreground  in  the  initial  frame,  ob¬ 
ject  B  is  mistakenly  cut  out  because  of  the  connectivity  in 
3D  space  (such  connectivity  doesn’t  occur  in  still  images). 
This  phenomenon  happens  when  undesired  objects  in  the 
background  touch  the  foreground  in  a  certain  frame,  and 
the  error  spreads  temporally  throughout  all  frames. 

We  address  this  problem  with  very  limited  extra  com¬ 
putation.  To  eliminate  the  branch  formed  by  the  undesired 
object  before  occlusion,  we  simply  constrain  the  propaga¬ 
tion  to  be  temporally  non-decreasing,  and  Equation  (2)  is 
replaced  by: 

d(si,  s2)  :=  min  /  W dp,  s.t.  t±  <  t2  if  Pi  <  P2, 

CS1,S2  Jo 

(6) 


where  pi,p2  C  [0,1]  indicate  any  two  positions  on 
CSi,S2{p)  and  f 1 , 12  are  their  corresponding  time  coordi¬ 
nates.  In  other  words,  d(si,s2 )  is  minimized  among  the 
paths  that  temporally  go  forwards.  Of  course  we  can  also 
constrain  the  distance  function  in  the  opposite  direction. 
However,  it  becomes  the  same  definition  if  we  let  the  video 
play  reversely. 

In  the  discrete  scenario,  the  temporal  links  (the  links  that 
connect  temporal  neighbors)  are  replaced  by  directed  links, 
i.e.,  the  weight  of  going  backwards  in  time  is  set  to  be  in¬ 
finity.  This  simple  modification  leads  to  the  correct  seg¬ 
mentation  before  the  occlusion,  but  confusion  might  still 
exists  after  the  occlusion  (Figure  5(c)).  We  can  further  re¬ 
move  the  wrong  branch  using  the  same  approach,  but  now 
in  the  opposite  direction.  This  can  be  done  by  specifying 
a  point  in  the  desired  tube  at  a  latter  time,  letting  it  propa¬ 
gate  backwards  within  the  tubes,  constrained  to  move  only 
backwards.  Figure  5  illustrates  the  process.  As  a  result, 
the  ambiguity  is  removed  in  frames  where  the  objects  are 
disconnected  within  the  frame. 

Figure  6  shows  the  example  of  two  people  walking.  The 
user  desired  to  segment  the  person  initially  on  the  right.  The 
two  people  are  merged  as  a  single  object  when  they  cross 
each  other  (since  they  share  the  features  that  are  used  to 
compute  the  weighted  distance).  The  second  column  shows 
the  results  using  the  distance  function  without  the  con¬ 
straint.  The  wrong  segment  appears  in  every  frame  (again, 
see  Figure  5(a)).  The  third  column  shows  the  result  by  the 
constrained  distance  function.  We  can  see  that  the  error  is 
removed  before  and  after  the  intersection.  Adding  scribbles 
in  the  intersection  frames  will  manage  to  separate  them  also 
there,  see  below,  but  this  is  left  without  in  this  figure  to  il¬ 
lustrate  the  power  of  the  “tubing”  effect  just  described. 

4.2.  Interactive  refinement 

For  individual  frames  where  occlusion  actually  happens 
and  can  not  be  fixed  by  the  “tubing”  approach  described 
above,  the  user  simply  provides  extra  scribbles  to  segment 
the  object.  Since  the  color  distribution  might  be  inadequate 
to  differentiate  the  objects  (this  is  what  led  to  their  merge  in 
the  first  case),  we  switch  to  another  contrast  sensitive  weight 
to  be  used  for  the  geodesic  distance  computation  in  Equa¬ 
tion  (2).  This  shows  the  power  of  the  framework,  features 
can  be  adapted  to  the  problem  at  hand.  For  discrete  images, 
the  new  feature  is  defined  as  Wpq  :=||  Ip  —  Iq  ||,  where  p 
and  q  are  two  adjacent  pixels  and  I  is  the  color  vector  in 
Luv  space.  Figure  7  shows  how  the  user  separates  the  two 
persons  using  the  new  weights. 

5.  Additional  video  experimental  results 

We  test  our  algorithm  on  three  videos  of  71,  79  and  78 
frames  respectively.  We  mark  scribbles  on  two  frames  for 


Figure  5.  Tubes  in  3D  space,  where  £1  <  £2  <  £3  (a)  Although 
the  scribbles  in  the  first  frame  intend  to  separate  A,  the  T  scrib¬ 
ble  (red)  reaches  the  object  B  by  a  path  in  3D  space  where  both 
objects  A  and  B  overlap,  (b)  The  scribble  propagation  is  con¬ 
strained  to  move  forward  and  the  branch  between  t\  and  £2  is 
eliminated,  (c)  The  user  specifies  a  pixel  in  A  at  £3  and  lets  it 
propagate  backwards.  The  branch  of  B  between  £2  and  £3  is  re¬ 
moved.  (d)  Result  with  the  proper  separation  of  the  object  A. 


the  video  in  Figure  8  and  just  a  single  frame  for  videos  in 
figures  9  and  10.  The  results  are  shown  in  figures  8,  9  and 
10  as  image  sequences  sampled  every  few  frames  (please 
see  the  videos  uploaded  with  the  supplementary  material  to 
appreciate  the  moving  camera  and  dynamic  background). 
The  columns  correspond  to  the  original  frames,  alpha  matte, 
composites  on  a  white  background,  and  composites  on  a 
new  movie. 

Finally,  we  compare  our  approach  with  the  rotoscoping 
algorithm  in  [4]  for  the  video  in  Figure  8  (we  only  refer  to 
the  segmentation/tracking  part,  which  is  the  contribution  of 
our  paper,  and  not  the  very  nice  special  effects  they  show 
after  the  segmentation  is  obtained).  Our  approach  has  a 
number  of  advantages  over  this  work:  (a)  We  need  signifi¬ 
cantly  less  user  interaction.  In  [4]  the  user  basically  needs  to 
draw  the  boundaries  for  all  keyframes  by  hand  (about  every 
10  frames  for  this  video),  while  our  method  only  requires 
very  few  rough  scribbles,  see  Figure  8.  (b)  We  explicitly 
compute  the  alpha  matte,  while  [4]  gives  spline  approxi¬ 
mations  of  the  detected  boundaries  (explicit  computation  of 
the  matte  was  not  in  the  original  goals  of  [4]  for  their  ap¬ 
plications).  (c)  Our  method  can  adapt  to  a  wide  variety  of 
motions  while  the  algorithm  in  [4]  easily  loses  track  of  the 
object,  especially  when  part  of  the  object  moves  out  of  the 


Figure  6.  A  video  example  of  two  people  crossing.  Left  column: 
original  video.  Scribbles  drawn  on  the  first  frame.  Middle  column: 
Segmented  results  using  unconstrained  distance  function.  Right 
column:  Segmented  results  using  constrained  distance  function. 
See  text  for  details. 


(a)  (b)  (c) 

Figure  1.  (a)  Original  segmentation  obtained  by  gradients  of  the 
PDF.  (b)  The  user  adds  new  scribbles,  (c)  Segmentation  results 
obtained  with  the  new  geodesic  distance. 


frame,  requiring  further  user  intervention.  To  better  illus¬ 
trate  the  comparison,  we  generate  the  boundaries  by  thresh¬ 
olding  and  dilating  the  alpha  matte  obtained  by  our  method. 
A  few  frames  are  shown  in  Figure  1 1 . 
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Figure  8.  Video  example  1.  (a  total  of  71  frames) 
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Figure  9.  Video  example  2.  (a  total  of  79  frames) 


Figure  10.  Video  example  3.  (a  total  of  78  frames) 


Figure  11.  Comparison  with  the  rotoscoping  algorithm  in  [4]. 
The  curves  indicate  the  boundaries.  Top  row:  A  few  frames  for 
the  work  in  [4],  obtained  by  their  provided  interface.  The  small 
squares  are  the  control  points  of  the  splines.  Bottom  row:  Results 
from  our  approach,  obtaining  similar  segmentation  with  signifi¬ 
cantly  less  user  intervention  ( see  Figure  8). 


6.  Conclusions  and  future  work 

We  presented  a  geodesics-based  algorithm  for  (interac¬ 
tive)  natural  image  and  video  segmentation  and  matting.  We 
introduced  the  framework  for  still  images  and  extended  it  to 
video  segmentation  and  matting.  We  added  constraints  to 
the  distance  function  in  order  to  handle  objects  that  cross 
each  other  in  the  video  temporal  domain.  We  showed  ex¬ 
amples  illustrating  the  application  of  this  framework  to  very 
different  images  and  videos,  including  videos  with  dynamic 
background  and  moving  cameras.  Another  application  of 
our  approach  is  to  speed  up  available  image  matting  algo¬ 
rithms  (e.g.  [20]).  A  narrow  band  trimap  is  quickly  gen¬ 
erated  from  a  few  scribbles,  and  then  a  different  matting 
algorithm  is  applied.  Figure  12  shows  our  method  working 
in  conjunction  with  [2(  ]. 


Although  the  proposed  framework  is  general,  we  mainly 
exploited  weights  in  the  geodesic  computation  that  depend 
on  the  pixel  value  distributions.  As  such,  in  this  form  the 
algorithm  works  best  when  these  distributions  do  not  signif¬ 
icantly  overlap.  In  principle,  this  can  be  solved  with  enough 
user  interactions,  but  could  be  tedious,  and  would  be  better 
to  solve  this  by  enhancing  the  features  used  in  deriving  the 
weights.  Our  current  efforts  are  concentrated  on  enhancing 
the  features  we  currently  use  for  weighting  the  geodesic. 
Also,  we  are  investigating  how  to  naturally  add  a  regular¬ 
ization  term  into  the  model,  without  having  to  perform  this 
as  a  post-processing  step  as  currently  done.  Results  in  these 
directions  will  be  reported  elsewhere. 
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